Using Association Rules of the Online Retail Dataset

1 Load Data

We first want to load our datasets and prepare them for some simple association rules mining.

tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")

tnx_data_tbl %>% glimpse()
## Rows: 1,044,848
## Columns: 23
## $ row_id            <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet       <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id        <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code        <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description       <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity          <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date      <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price             <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id       <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country           <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr    <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm      <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month     <fct> December, December, December, December, December, De…
## $ invoice_dow       <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom       <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour      <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute    <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy       <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym        <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value       <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.

tnx_purchase_tbl <- tnx_data_tbl %>%
  filter(
    quantity > 0,
    price > 0,
    exclude == FALSE
    ) %>%
  select(
    invoice_id, stock_code, customer_id, quantity, price, stock_value,
    description
    )

tnx_purchase_tbl %>% glimpse()
## Rows: 1,015,088
## Columns: 7
## $ invoice_id  <chr> "489434", "489434", "489434", "489434", "489434", "489434"…
## $ stock_code  <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "2…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "130…
## $ quantity    <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3,…
## $ price       <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.75…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 3…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHTS…

We now write this data out as a CSV so arules can read it in and process it.

tnx_purchase_tbl %>% write_csv("data/tnx_purchase_tbl.csv")

We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.

stock_description_tbl <- read_rds("data/stock_description_tbl.rds")

stock_description_tbl %>% glimpse()
## Rows: 4,725
## Columns: 2
## $ stock_code <chr> "10002", "10002R", "10080", "10109", "10120", "10123C", "10…
## $ desc       <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "GRO…

2 Basket Analysis with Association Rules

We now want to do some basic basket analysis using association rules, which tries to determine which items get bought together, similar to taking a graph approachin many ways.

basket_arules <- read.transactions(
    file   = "data/tnx_purchase_tbl.csv",
    format = "single",
    sep    = ",",
    header = TRUE,
    cols   = c("invoice_id", "stock_code")
    )

basket_arules %>% glimpse()
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  4895 obs. of  1 variable:
##   .. ..$ labels: chr [1:4895] "10002" "10002R" "10080" "10109" ...
##   ..@ itemsetInfo:'data.frame':  39516 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:39516] "489434" "489435" "489436" "489437" ...

Now that we have this data we can look at some basic plots much like we produced before. For example, we can look at the relative frequency of the different items.

itemFrequencyPlot(basket_arules, topN = 20)

itemFrequencyPlot(basket_arules, topN = 20, type = "absolute")

The stock codes do not mean a huge amount to us, so we also want to look at the description field for these items.

freq_codes <- itemFrequency(basket_arules) %>%
  sort(decreasing = TRUE) %>%
  head(20) %>%
  names()

stock_description_tbl %>%
  filter(stock_code %in% freq_codes) %>%
  arrange(stock_code) %>%
  datatable()

2.1 Basic Concepts

The basic ideas of association rule mining and basket analysis draws on basic ideas from probability theory.

We speak in terms of the itemset: that is, a collection of one or more items that co-occur in a transaction.

For example, suppose we have a list of transactions as follows:

ID Items
1 milk, bread
2 bread, butter
3 beer
4 milk, bread, butter
5 bread, butter

Using the above set of transactions, and itemset may be “milk” or “bread, butter”.

The support of an itemset \(X\), \(\text{Supp}(X)\), is defined as the proportion of transactions in the dataset which contain the itemset.

In the above example:

\[ \text{Supp}(\text{\{milk, bread\}}) = \frac{2}{5} = 0.40. \]

A rule, \(X \Rightarrow Y\), between two itemsets \(X\) and \(Y\) is a directed relationship of the itemset \(X\) showing the presence of \(Y\). The rule is not symmetric: \(X \Rightarrow Y\) and \(Y \Rightarrow X\) are not the same.

The confidence for the rule \(X \implies Y\), \(\text{Conf}(X \Rightarrow Y)\) is defined by

\[ \text{Conf}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X)}. \]

So, to calculate the confidence for a rule:

\[ \text{Conf}(\text{\{milk, bread\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{0.4} = 0.5. \]

To illustrate how rules are not symmetric:

\[ \text{Conf}(\text{\{butter\}} \Rightarrow \text{\{milk, bread\}}) = \frac{0.2}{0.6} = 0.33. \]

Finally, we want a measure of the strength of the relationship between the itemsets \(X\) and \(Y\). That is, measuring the effect of the presence of \(X\) on the presence of \(Y\). We measure this by defining the lift of a rule as

\[ \text{Lift}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X) \text{Supp}(Y)}. \]

Again, we repeat our calculations for our rule.

\[ \text{Lift}(\text{\{bread, milk\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{(0.4)(0.6)} = \frac{0.2}{0.24} = 0.8333 \]

Lift values greater than 1 implies the presence of \(X\) increases the probability of \(Y\) being present when compared to the unconditional probability.

Now that we have these metrics and concepts, we can turn our attention to trying to find rules in a given dataset, using these metrics to rank them.

Rather than using brute-force approaches to discovering these rules, we use a number of different algorithms to find associations within the dataset.

The two main algorithms for discovering some rules are the apriori and the eclat algorithms.

2.2 Construct apriori Rules

We now want to construct the association rules using the apriori algorithm. To do this, we need to set parameters such as the minimum support and the minimum confidence level.

This gives us a set of association rules, along with the support and lift.

basket_apriori <- apriori(
    basket_arules,
    parameter = list(supp = 0.005, conf = 0.8)
    )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 197 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4895 item(s), 39516 transaction(s)] done [0.35s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.19s].
## writing ... [565 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
basket_apriori_tbl <- basket_apriori %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))

basket_apriori_tbl %>% glimpse()
## Rows: 565
## Columns: 6
## $ rules      <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "{2…
## $ support    <dbl> 0.005390222, 0.005187772, 0.005390222, 0.005086547, 0.00528…
## $ confidence <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.93…
## $ coverage   <dbl> 0.005693896, 0.005415528, 0.005668590, 0.005415528, 0.00556…
## $ lift       <dbl> 150.2349, 150.2147, 149.1091, 149.0582, 148.9690, 148.7356,…
## $ count      <int> 213, 205, 213, 201, 209, 209, 205, 204, 204, 203, 203, 232,…

We now want to inspect this table using the ruleExplorer()

basket_apriori %>% ruleExplorer()

To help visualise these rules, we can produce a basic scatterplot of the metrics.

ggplot(basket_apriori_tbl) +
  geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
  xlab("Rule Confidence") +
  ylab("Rule Lift") +
  ggtitle("Scatterplot of Association Rule Metrics")

2.3 Construct eclat Rules

An alternative method of constructing association rules is to use the eclat algorithm. The code for doing this is slightly different, but gives us similar outputs.

basket_eclat <- eclat(
    basket_arules,
    parameter = list(support = 0.005)
    ) %>%
  ruleInduction(
    basket_arules,
    confidence = 0.8
    )
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.005      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 197 
## 
## create itemset ... 
## set transactions ...[4895 item(s), 39516 transaction(s)] done [0.35s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating sparse bit matrix ... [1445 row(s), 39516 column(s)] done [0.01s].
## writing  ... [7774 set(s)] done [4.17s].
## Creating S4 object  ... done [0.00s].
basket_eclat_tbl <- basket_eclat %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))

basket_eclat_tbl %>% glimpse()
## Rows: 565
## Columns: 5
## $ rules      <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "{2…
## $ support    <dbl> 0.005390222, 0.005187772, 0.005390222, 0.005086547, 0.00528…
## $ confidence <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.93…
## $ lift       <dbl> 150.2349, 150.2147, 149.1091, 149.0582, 148.9690, 148.7356,…
## $ itemset    <int> 69, 41, 69, 39, 73, 73, 41, 26, 26, 38, 68, 76, 76, 40, 69,…

Once again, we inspect the data using ruleExplorer()

basket_eclat %>% ruleExplorer()

2.4 Compare Algorithms

We now want to compare the outputs of both algorithms in terms of association rules and how they compare.

basket_ap_tbl <- basket_apriori_tbl %>%
  select(rules, support, confidence, lift)

basket_ec_tbl <- basket_eclat_tbl %>%
  select(rules, support, confidence, lift)

rules_comparison_tbl <- basket_ap_tbl %>%
  full_join(basket_ec_tbl, by = "rules", suffix = c("_a", "_e"))

rules_comparison_tbl %>% glimpse()
## Rows: 565
## Columns: 7
## $ rules        <chr> "{22917,22918} => {22916}", "{22916,22919} => {22917}", "…
## $ support_a    <dbl> 0.005390222, 0.005187772, 0.005390222, 0.005086547, 0.005…
## $ confidence_a <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.…
## $ lift_a       <dbl> 150.2349, 150.2147, 149.1091, 149.0582, 148.9690, 148.735…
## $ support_e    <dbl> 0.005390222, 0.005187772, 0.005390222, 0.005086547, 0.005…
## $ confidence_e <dbl> 0.9466667, 0.9579439, 0.9508929, 0.9392523, 0.9500000, 0.…
## $ lift_e       <dbl> 150.2349, 150.2147, 149.1091, 149.0582, 148.9690, 148.735…

2.5 Reducing Minimum Confidence

While high confidence rules are useful, they are more likely to find rules that are “obvious” as the probabilities are such that co-occuring basket items will be noticed as being together - or possibly be natural complements: butter, milk and bread is a good example.

Instead, we are also interested in less obvious rules, and so we reduce our confidence threshold and see how many additional rules are discovered.

basket_lower_rules <- apriori(
    basket_arules,
    parameter = list(supp = 0.005, conf = 0.4)
  )
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 197 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4895 item(s), 39516 transaction(s)] done [0.36s].
## sorting and recoding items ... [1445 item(s)] done [0.03s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.19s].
## writing ... [6039 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
basket_lower_rules_tbl <- basket_lower_rules %>%
  as("data.frame") %>%
  as_tibble() %>%
  arrange(desc(lift))
ggplot(basket_lower_rules_tbl) +
  geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
  xlab("Rule Confidence") +
  ylab("Rule Lift") +
  ggtitle("Scatterplot of Association Rule Metrics")

3 Converting Rules to Graphs

We also have the ability to convert these rules to a graph representation, where each node is either a stock_code or a rule, with the edges of the graph representing that item being contained in the rule.

apriori_rules_igraph <- basket_apriori %>%
  plot(
    measure = "support",
    method  = "graph",
    engine  = "igraph",
    control = list(max = 1000)
    )

apriori_rules_igraph %>% print()
## IGRAPH f582eea DN-B 708 2125 -- 
## + attr: name (v/c), label (v/c), type (v/n), support (v/n), confidence
## | (v/n), coverage (v/n), lift (v/n), count (v/n), order (v/n)
## + edges from f582eea (vertex names):
##  [1] 1732->assoc1  1255->assoc2  2120->assoc3  2117->assoc4  2120->assoc5 
##  [6] 2115->assoc6  2120->assoc7  2118->assoc8  2117->assoc9  2118->assoc10
## [11] 2119->assoc11 2118->assoc12 2115->assoc13 2118->assoc14 2116->assoc15
## [16] 2117->assoc16 2119->assoc17 2117->assoc18 2115->assoc19 2117->assoc20
## [21] 2116->assoc21 2119->assoc22 2115->assoc23 2119->assoc24 2116->assoc25
## [26] 2115->assoc26 2116->assoc27 2360->assoc28 2360->assoc29 1950->assoc30
## [31] 1950->assoc31 1950->assoc32 3748->assoc33 2359->assoc34 1949->assoc35
## + ... omitted several edges

We should first visualise this graph, using the top 100 rules in the dataset, as measured by the support of the rule.

basket_apriori %>%
  head(n = 100, by = "support") %>%
  plot(
    measure  = "lift",
    method   = "graph",
    engine   = "htmlwidget"
    )

3.1 Extract Connected Product Labels

First we want to look at the different disjoint components of the graph, and label them with an ID.

apriori_rules_tblgraph <- apriori_rules_igraph %>%
  igraph::as.undirected(mode = "collapse") %>%
  as_tbl_graph() %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

We then want to create groups of common products that form a disjoint cluster within this graph.

product_groups_all_tbl <- apriori_rules_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(component_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(
    product_group_id = component_id,
    product_count,
    stock_code = label
    ) %>%
  arrange(product_group_id, stock_code)

product_groups_all_tbl %>% glimpse()
## Rows: 143
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count    <int> 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 6…
## $ stock_code       <chr> "20711", "20712", "20713", "20718", "20719", "20723",…

For display purposes, we can show all the stock_id values in a list.

3.1.1 Cluster Larger Groups

Within the large disjoint cluster there are a large number of products so rather than treating this as a single group we instead may investigate using further graph clustering algorithms to create further groupings.

apriori_rules_large_tblgraph <- apriori_rules_tblgraph %>%
  convert(to_subgraph, component_size == max(component_size)) %>%
  morph(to_undirected) %>%
  mutate(
    sub_id = group_louvain()
    ) %>%
  unmorph()

Now that we have sub-divided this large subgraph, we repeat the process.

product_groups_largest_tbl <- apriori_rules_large_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(sub_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(
    product_group_id = sub_id, product_count, stock_code = label
    ) %>%
  arrange(product_group_id, stock_code)

product_groups_largest_tbl %>% glimpse()
## Rows: 68
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ product_count    <int> 8, 8, 8, 8, 8, 8, 8, 8, 12, 12, 12, 12, 12, 12, 12, 1…
## $ stock_code       <chr> "20719", "20723", "20728", "21212", "21213", "21976",…

Finally, it is worth trying to use an interactive tool to investigate this subgraph, we we can use visNetwork() to produce an interactive JS tool

apriori_rules_large_tblgraph %>%
  toVisNetworkData(idToLabel = FALSE) %>%
  visNetwork(
    nodes = .$nodes %>% transmute(id, label, group = sub_id),
    edges = .$edges
    )

3.2 Evaluating Product Groups

How do we go about assessing the validity of these product groups?

Note that this work is exploratory - in effect this is more sophisticated data exploration. Rather than use this model to make predictions - a job we will need to do at some point, we instead just want to assess how novel these grouping are.

To that end, it may be useful to check the co-occurrence of these products as a group - for each purchase made by a customer, what proportion of the group was featured in this data?

This question is worth exploring, so we should write some code to assess this.

Before we do this, we combine our two lists of product groups into a single table.

product_groups_tbl <- list(
    ALL = product_groups_all_tbl,
    LRG = product_groups_largest_tbl
    ) %>%
  bind_rows(.id = "type") %>%
  mutate(
    group_label = sprintf("%s_%02d", type, product_group_id)
    ) %>%
  group_by(group_label) %>%
  mutate(
    group_size  = n()
    ) %>%
  ungroup() %>%
  select(group_label, group_size, stock_code)

product_groups_tbl %>% glimpse()
## Rows: 211
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size  <int> 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68…
## $ stock_code  <chr> "20711", "20712", "20713", "20718", "20719", "20723", "207…
tnx_groups_tbl <- tnx_data_tbl %>%
  select(invoice_id, invoice_date, stock_code) %>%
  group_nest(invoice_id, .key = "invoice_data")

group_props_tbl <- product_groups_tbl %>%
  group_nest(group_label, group_size, .key = "stock_data") %>%
  filter(group_size > 1, group_size < 15) %>%
  expand_grid(tnx_groups_tbl) %>%
  mutate(
    comb_data = future_map2(
      invoice_data, stock_data,
      inner_join,
      by = "stock_code",
    
      .options = furrr_options(globals = FALSE)
      ),
    match_count = map_int(comb_data, nrow),
    group_prop  = match_count / group_size
    ) %>%
  select(group_label, group_size, group_prop) %>%
  filter(group_prop > 0)

group_props_tbl %>% glimpse()
## Rows: 54,457
## Columns: 3
## $ group_label <chr> "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02"…
## $ group_size  <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ group_prop  <dbl> 0.2, 0.6, 0.3, 0.3, 0.1, 0.7, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2…

We now create a histogram of the proportions for each group, and this gives us a gauge of the ‘novelty’ of each of these groups.

plot_tbl <- group_props_tbl %>%
  mutate(label = glue("{group_label} ({group_size})"))

ggplot(plot_tbl) +
  geom_histogram(aes(x = group_prop), binwidth = 0.1) +
  facet_wrap(vars(label), scales = "free_y") +
  scale_y_continuous(labels = label_comma()) +
  xlab("Proportion") +
  ylab("Purchase Count") +
  ggtitle("Facetted Histograms of Group Coverages by Product Grouping") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

3.2.1 Investigate Groups

Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.

product_groups_tbl %>%
  filter(group_size > 1, group_size < 15) %>%
  mutate(stock_code = stock_code %>% str_trim() %>% str_to_upper()) %>%
  left_join(stock_description_tbl, by = "stock_code") %>%
  datatable()

4 Investigate Lower Support Rules

Our previous analysis was all based on rules with a minimum confidence of 0.80 so we now want to repeat our analysis but on this new set of rules.

apriori_lower_rules_igraph <- basket_lower_rules %>%
  plot(
    measure = "support",
    method  = "graph",
    engine  = "igraph",
    control = list(max = 5000)
    )

apriori_lower_rules_igraph %>% print()
## IGRAPH fb5899c DN-B 5460 15803 -- 
## + attr: name (v/c), label (v/c), type (v/n), support (v/n), confidence
## | (v/n), coverage (v/n), lift (v/n), count (v/n), order (v/n)
## + edges from fb5899c (vertex names):
##  [1] 258 ->assoc1  1581->assoc1  1582->assoc1  1605->assoc1  259 ->assoc2 
##  [6] 1581->assoc2  1582->assoc2  1605->assoc2  258 ->assoc3  259 ->assoc3 
## [11] 1582->assoc3  1605->assoc3  258 ->assoc4  259 ->assoc4  1581->assoc4 
## [16] 1605->assoc4  258 ->assoc5  259 ->assoc5  1581->assoc5  1582->assoc5 
## [21] 2149->assoc6  1361->assoc7  2392->assoc8  4103->assoc9  2484->assoc10
## [26] 1111->assoc11 1112->assoc12 861 ->assoc13 636 ->assoc14 639 ->assoc14
## [31] 220 ->assoc15 639 ->assoc15 220 ->assoc16 636 ->assoc16 636 ->assoc17
## + ... omitted several edges

Once again we have a quick look at the top 50 rules.

basket_lower_rules %>%
  head(n = 50, by = "support") %>%
  plot(
    measure  = "lift",
    method   = "graph",
    engine   = "htmlwidget"
    )

4.1 Determine Distinct Rules Subgraphs

Having converted the association rules to the graph, we then look at the distinct components of this graph and use these as our first pass at these clusters.

apriori_lower_rules_tblgraph <- apriori_lower_rules_igraph %>%
  igraph::as.undirected(mode = "collapse") %>%
  as_tbl_graph() %>%
  mutate(
    component_id = group_components()
    ) %>%
  group_by(component_id) %>%
  mutate(
    component_size = n()
    ) %>%
  ungroup()

apriori_lower_rules_tblgraph %>% print()
## # A tbl_graph: 5460 nodes and 15803 edges
## #
## # A bipartite simple graph with 88 components
## #
## # Node Data: 5,460 × 11 (active)
##   name  label  type support confidence coverage  lift count order component_id
##   <chr> <chr> <dbl>   <dbl>      <dbl>    <dbl> <dbl> <int> <int>        <int>
## 1 25    1505…     1      NA         NA       NA    NA    NA    NA           14
## 2 27    1505…     1      NA         NA       NA    NA    NA    NA           14
## 3 29    1505…     1      NA         NA       NA    NA    NA    NA           14
## 4 217   20674     1      NA         NA       NA    NA    NA    NA            2
## 5 218   20675     1      NA         NA       NA    NA    NA    NA            2
## 6 219   20676     1      NA         NA       NA    NA    NA    NA            2
## # … with 5,454 more rows, and 1 more variable: component_size <int>
## #
## # Edge Data: 15,803 × 2
##    from    to
##   <int> <int>
## 1    16   461
## 2    17   461
## 3   203   461
## # … with 15,800 more rows
product_groups_lower_all_tbl <- apriori_lower_rules_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(component_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(product_group_id = component_id, product_count, stock_code = label) %>%
  arrange(product_group_id, stock_code)

product_groups_lower_all_tbl %>% glimpse()
## Rows: 460
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count    <int> 215, 215, 215, 215, 215, 215, 215, 215, 215, 215, 215…
## $ stock_code       <chr> "20711", "20712", "20713", "20717", "20718", "20719",…
apriori_lower_rules_bigcomp_tblgraph <- apriori_lower_rules_tblgraph %>%
  convert(to_subgraph, component_size == max(component_size)) %>%
  mutate(
    sub_id = group_louvain()
    )

apriori_lower_rules_bigcomp_tblgraph %>% print()
## # A tbl_graph: 4453 nodes and 13834 edges
## #
## # A bipartite simple graph with 1 component
## #
## # Node Data: 4,453 × 13 (active)
##   name  label  type support confidence coverage  lift count order component_id
##   <chr> <chr> <dbl>   <dbl>      <dbl>    <dbl> <dbl> <int> <int>        <int>
## 1 247   20711     1      NA         NA       NA    NA    NA    NA            1
## 2 248   20712     1      NA         NA       NA    NA    NA    NA            1
## 3 249   20713     1      NA         NA       NA    NA    NA    NA            1
## 4 253   20717     1      NA         NA       NA    NA    NA    NA            1
## 5 254   20718     1      NA         NA       NA    NA    NA    NA            1
## 6 255   20719     1      NA         NA       NA    NA    NA    NA            1
## # … with 4,447 more rows, and 3 more variables: component_size <int>,
## #   .tidygraph_node_index <int>, sub_id <int>
## #
## # Edge Data: 13,834 × 3
##    from    to .tidygraph_edge_index
##   <int> <int>                 <int>
## 1     8   216                     1
## 2     9   216                     2
## 3    88   216                     3
## # … with 13,831 more rows
product_groups_lower_bigcomp_tbl <- apriori_lower_rules_bigcomp_tblgraph %>%
  activate(nodes) %>%
  as_tibble() %>%
  filter(are_na(support)) %>%
  group_by(sub_id) %>%
  mutate(
    product_count = n()
    ) %>%
  ungroup() %>%
  select(product_group_id = sub_id, product_count, stock_code = label) %>%
  arrange(product_group_id, stock_code)

product_groups_lower_bigcomp_tbl %>% glimpse()
## Rows: 215
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count    <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 2…
## $ stock_code       <chr> "20711", "20712", "20713", "21033", "21928", "21929",…
product_groups_lower_tbl <- list(
    ALL = product_groups_lower_all_tbl,
    LRG = product_groups_lower_bigcomp_tbl
    ) %>%
  bind_rows(.id = "type") %>%
  mutate(
    group_label = sprintf("%s_%02d", type, product_group_id)
    ) %>%
  group_by(group_label) %>%
  mutate(
    group_size = n()
    ) %>%
  ungroup() %>%
  select(group_label, group_size, stock_code)

product_groups_lower_tbl %>% glimpse()
## Rows: 675
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size  <int> 215, 215, 215, 215, 215, 215, 215, 215, 215, 215, 215, 215…
## $ stock_code  <chr> "20711", "20712", "20713", "20717", "20718", "20719", "207…

Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.

product_groups_lower_tbl %>%
  filter(group_size > 1, group_size != max(group_size)) %>%
  mutate(stock_code = stock_code %>% str_trim() %>% str_to_upper()) %>%
  left_join(stock_description_tbl, by = "stock_code") %>%
  datatable()

5 Output Data to Disk

We now want to write the various data groups to disk.

As this may be useful for later analysis and for later modelling, we output these groupings for later use.

product_groups_tbl       %>% write_rds("data/product_groups_tbl.rds")
product_groups_lower_tbl %>% write_rds("data/product_groups_lower_tbl.rds")

6 R Environment

sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.1 (2021-08-10)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2021-12-06                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  arules      * 1.6-8   2021-05-17 [1] RSPM (R 4.1.0)
##  arulesViz   * 1.5-0   2021-05-21 [1] RSPM (R 4.1.0)
##  assertthat    0.2.1   2019-03-21 [1] RSPM (R 4.1.0)
##  backports     1.3.0   2021-10-27 [1] RSPM (R 4.1.0)
##  bit           4.0.4   2020-08-04 [1] RSPM (R 4.1.0)
##  bit64         4.0.5   2020-08-30 [1] RSPM (R 4.1.0)
##  bookdown      0.24    2021-09-02 [1] RSPM (R 4.1.0)
##  broom         0.7.9   2021-07-27 [1] RSPM (R 4.1.0)
##  bslib         0.3.1   2021-10-06 [1] RSPM (R 4.1.0)
##  cachem        1.0.6   2021-08-19 [1] RSPM (R 4.1.0)
##  cellranger    1.1.0   2016-07-27 [1] RSPM (R 4.1.0)
##  cli           3.1.0   2021-10-27 [1] RSPM (R 4.1.0)
##  codetools     0.2-18  2020-11-04 [2] CRAN (R 4.1.1)
##  colorspace    2.0-2   2021-06-24 [1] RSPM (R 4.1.0)
##  conflicted  * 1.0.4   2019-06-21 [1] RSPM (R 4.1.0)
##  cowplot     * 1.1.1   2020-12-30 [1] RSPM (R 4.1.0)
##  crayon        1.4.1   2021-02-08 [1] RSPM (R 4.1.0)
##  crosstalk     1.1.1   2021-01-12 [1] RSPM (R 4.1.0)
##  DBI           1.1.1   2021-01-15 [1] RSPM (R 4.1.0)
##  dbplyr        2.1.1   2021-04-06 [1] RSPM (R 4.1.0)
##  digest        0.6.28  2021-09-23 [1] RSPM (R 4.1.0)
##  dplyr       * 1.0.7   2021-06-18 [1] RSPM (R 4.1.0)
##  DT          * 0.19    2021-09-02 [1] RSPM (R 4.1.0)
##  ellipsis      0.3.2   2021-04-29 [1] RSPM (R 4.1.0)
##  evaluate      0.14    2019-05-28 [1] RSPM (R 4.1.0)
##  fansi         0.5.0   2021-05-25 [1] RSPM (R 4.1.0)
##  farver        2.1.0   2021-02-28 [1] RSPM (R 4.1.0)
##  fastmap       1.1.0   2021-01-25 [1] RSPM (R 4.1.0)
##  forcats     * 0.5.1   2021-01-27 [1] RSPM (R 4.1.0)
##  fs            1.5.0   2020-07-31 [1] RSPM (R 4.1.0)
##  furrr       * 0.2.3   2021-06-25 [1] RSPM (R 4.1.0)
##  future      * 1.22.1  2021-08-25 [1] RSPM (R 4.1.0)
##  generics      0.1.1   2021-10-25 [1] RSPM (R 4.1.0)
##  ggplot2     * 3.3.5   2021-06-25 [1] RSPM (R 4.1.0)
##  globals       0.14.0  2020-11-22 [1] RSPM (R 4.1.0)
##  glue        * 1.4.2   2020-08-27 [1] RSPM (R 4.1.0)
##  gtable        0.3.0   2019-03-25 [1] RSPM (R 4.1.0)
##  haven         2.4.3   2021-08-04 [1] RSPM (R 4.1.0)
##  highr         0.9     2021-04-16 [1] RSPM (R 4.1.0)
##  hms           1.1.1   2021-09-26 [1] RSPM (R 4.1.0)
##  htmltools     0.5.2   2021-08-25 [1] RSPM (R 4.1.0)
##  htmlwidgets   1.5.4   2021-09-08 [1] RSPM (R 4.1.0)
##  httr          1.4.2   2020-07-20 [1] RSPM (R 4.1.0)
##  igraph        1.2.7   2021-10-15 [1] RSPM (R 4.1.0)
##  jquerylib     0.1.4   2021-04-26 [1] RSPM (R 4.1.0)
##  jsonlite      1.7.2   2020-12-09 [1] RSPM (R 4.1.0)
##  knitr         1.36    2021-09-29 [1] RSPM (R 4.1.0)
##  labeling      0.4.2   2020-10-20 [1] RSPM (R 4.1.0)
##  lattice       0.20-44 2021-05-02 [2] CRAN (R 4.1.1)
##  lifecycle     1.0.1   2021-09-24 [1] RSPM (R 4.1.0)
##  listenv       0.8.0   2019-12-05 [1] RSPM (R 4.1.0)
##  lubridate     1.8.0   2021-10-07 [1] RSPM (R 4.1.0)
##  magrittr    * 2.0.1   2020-11-17 [1] RSPM (R 4.1.0)
##  Matrix      * 1.3-4   2021-06-01 [2] CRAN (R 4.1.1)
##  modelr        0.1.8   2020-05-19 [1] RSPM (R 4.1.0)
##  munsell       0.5.0   2018-06-12 [1] RSPM (R 4.1.0)
##  parallelly    1.28.1  2021-09-09 [1] RSPM (R 4.1.0)
##  pillar        1.6.4   2021-10-18 [1] RSPM (R 4.1.0)
##  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.1.0)
##  purrr       * 0.3.4   2020-04-17 [1] RSPM (R 4.1.0)
##  R6            2.5.1   2021-08-19 [1] RSPM (R 4.1.0)
##  Rcpp          1.0.7   2021-07-07 [1] RSPM (R 4.1.0)
##  readr       * 2.0.2   2021-09-27 [1] RSPM (R 4.1.0)
##  readxl        1.3.1   2019-03-13 [1] RSPM (R 4.1.0)
##  reprex        2.0.1   2021-08-05 [1] RSPM (R 4.1.0)
##  rlang       * 0.4.12  2021-10-18 [1] RSPM (R 4.1.0)
##  rmarkdown     2.11    2021-09-14 [1] RSPM (R 4.1.0)
##  rmdformats    1.0.3   2021-10-06 [1] RSPM (R 4.1.0)
##  rstudioapi    0.13    2020-11-12 [1] RSPM (R 4.1.0)
##  rvest         1.0.2   2021-10-16 [1] RSPM (R 4.1.0)
##  sass          0.4.0   2021-05-12 [1] RSPM (R 4.1.0)
##  scales      * 1.1.1   2020-05-11 [1] RSPM (R 4.1.0)
##  sessioninfo   1.1.1   2018-11-05 [1] RSPM (R 4.1.0)
##  stringi       1.7.5   2021-10-04 [1] RSPM (R 4.1.0)
##  stringr     * 1.4.0   2019-02-10 [1] RSPM (R 4.1.0)
##  tibble      * 3.1.5   2021-09-30 [1] RSPM (R 4.1.0)
##  tidygraph   * 1.2.0   2020-05-12 [1] RSPM (R 4.1.0)
##  tidyr       * 1.1.4   2021-09-27 [1] RSPM (R 4.1.0)
##  tidyselect    1.1.1   2021-04-30 [1] RSPM (R 4.1.0)
##  tidyverse   * 1.3.1   2021-04-15 [1] RSPM (R 4.1.0)
##  tzdb          0.2.0   2021-10-27 [1] RSPM (R 4.1.0)
##  utf8          1.2.2   2021-07-24 [1] RSPM (R 4.1.0)
##  vctrs         0.3.8   2021-04-29 [1] RSPM (R 4.1.0)
##  visNetwork    2.1.0   2021-09-29 [1] RSPM (R 4.1.0)
##  vroom         1.5.5   2021-09-14 [1] RSPM (R 4.1.0)
##  withr         2.4.2   2021-04-18 [1] RSPM (R 4.1.0)
##  xfun          0.27    2021-10-18 [1] RSPM (R 4.1.0)
##  xml2          1.3.2   2020-04-23 [1] RSPM (R 4.1.0)
##  yaml          2.2.1   2020-02-01 [1] RSPM (R 4.1.0)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library